14 research outputs found

    Relevance Prediction in Information Extraction using Discourse and Lexical Features

    Get PDF
    Proceedings of the 18th Nordic Conference of Computational Linguistics NODALIDA 2011. Editors: Bolette Sandford Pedersen, Gunta Nešpore and Inguna Skadiņa. NEALT Proceedings Series, Vol. 11 (2011), 114-121. © 2011 The editors and contributors. Published by Northern European Association for Language Technology (NEALT) http://omilia.uio.no/nealt . Electronically published at Tartu University Library (Estonia) http://hdl.handle.net/10062/16955

    Matrix Factorization for Learning Metagenomic Pathways and Species

    Get PDF
    This work considers learning meaningful sets of chemical reactions called pathways and groups of species called Operational Taxonomical Units (OTUs) from metagenomic data. The methods are based on Nonnegative Matrix Factorization (NMF). The rows of our data matrix correspond to metagenomic samples and columns correspond to chemical reactions present in the samples. In order to learn both pathways and OTUs as well as relationships between them, we consider ways to factorize the data matrix into three factors instead of two. Denoting the samples times reactions data matrix by V, our factorization problem setting is to find nonnegative matrices W, H and P so that V is approximately WHP. The matrix W tells what OTUs are present in each of the samples, P defines pathways as combinations of reactions while H describes what pathways are implemented by which OTUs. We first discuss two standard NMF algorithms based on different objective functions and four sparsity constrained variants. Sparsity constrained variants are designed to produce output matrices with few values significantly above zero. We are interested in sparser variants because metagenomic pathways are short, thus the method should find a representation where only a small set of reactions is present in each pathway. We describe how using a standard two-factor NMF method twice yields a three-factor representation. We briefly mention an existing method, Nonnegative Matrix Tri-factorization (NMTF), that learns all three matrices W, H and P simultaneously. However, this method applies hard orthogonality constraints, i.e. it only finds solutions where the matrices W and P are orthogonal. Because of this constraint, NMTF is not suitable in our biological problem setting. We introduce an unconstrained method called NMF3 as well as a sparsity constrained variant SNMF3 based on Sparse Nonnegative Matrix Factorization (SNMF) and show how both of these algorithms can be derived. In order to compare the different algorithms' performance, we have built two synthetic data sets. Both sets are based on human intestinal species and pathway information available in an existing biological database. One of the data matrices can be exactly factorized into the underlying matrices used to generate the data. The other data set is built through simulating a sampling process that introduces noise and strictly limits the number of observed reactions per sample. We tested factorization methods discussed in the thesis on both data sets, using 100 to 1500 samples. We compare the methods and show and discuss the results. We found differences between NMF variants that use different objective functions. Many methods perform well on our task, surprisingly even in the case where the number of pathways is greater than the number of samples. Varying the number of samples affected the results less than we expected. Instead, we found that all algorithms performed significantly better on the factorizable data than on the simulated set.We conclude that the number of available metagenomic samples does not dramatically affect the performance of the factorization methods. More important is the quality of the samples

    Event representation across genre

    Get PDF
    Peer reviewe

    My Way -messujen hyöty ja kehittäminen. : Kyselytutkimus Keski-Suomen opinto-ohjaajille.

    Get PDF
    Opinnäytetyöni on kyselytutkimus Keski-Suomen alueella toimiville yläkoulujen sekä toisen asteen oppilaitosten opinto-ohjaajille. Tutkimuksen tavoitteena oli selvittää, kuinka hyödyllisenä opinto-ohjaajat kokevat My Way -messut ja kuinka niitä voitaisiin kehittää. Aineistonkeruumenetelmänä olen käyttänyt sähköistä lomaketta. Kysely toteutettiin toukokuussa 2010. Työn tilaajana toimi Nuorten Keski-Suomi ry, joka on järjestänyt My Way -messuja vuodesta 2005 lähtien yhdessä nuorista opiskelijoista koostuvan tiimin kanssa. Messujen tarkoituksena on tuoda 14-22-vuotiaille nuorille ajankohtaista tietoa opiskelusta, harrastusmahdollisuuksista, sekä neuvontapalveluista. Samalla messut tarjoavat tietoa nuorten kanssa työskenteleville aikuisille. Yhä useampi nuori jää vuosittain ilman työ- tai koulutuspaikkaa, joka saattaa johtaa pidemmällä aikavälillä erilaisten ongelmien kasautumiseen. Koulutus- ja harrastemessut ovat yksi tapa ehkäistä tällaisten ”koulupudokkaiden” syntymistä. Suomessa on lähivuosina panostettu nuorten nivelvaiheen ohjaukseen muun muassa aiheeseen suunnatuilla projekteilla. Myös oppilaanohjaus kouluissa on yhä suurenevassa roolissa. Toivonkin opinnäytetyölläni tuovani koulujen oppilaanohjausta sekä My Way -messuja lähemmäksi toisiaan.My thesis is a survay to guidance counselors who work in secondary schools, vocational schools and upper secondary schools in Central Finland. My aim was to find out how useful My Way -expo is according to guidance counselors and how they would improve it. I collected the material for the thesis using a questionnaire on the internet. The survey was carried out in May 2010. The subscriber of my thesis was Nuorten Keski-Suomi ry. Nuorten Keski-Suomi ry has been organizing My Way -expo since the year 2005 along with a team of young students. The purpose of the expo is to bring current knowledge about studying, hobbies and consultative services to youngsters in the age of 14-22. At the same time, adults working with the youngsters can get new and useful info as well. Every year more and more youngsters are left without a job or a place to study. In a long time period this could lead in to accumulation of different kinds of problems. Expos about education and hobbies are one way of preventing the birth of these so called “school drop outs”. In the past few years, Finland has made an effort to support the guidance in between schools for example with different projects. At the same time, the student counselling in schools is becoming more important. With my thesis, I wish is to bring the student counselling in schools and My Way -expo closer together

    Information Extraction and linguistic characteristics of texts : exploring scenarios and text types

    Get PDF
    Information Extraction (IE) is the systematic harvesting of information from natural language text and speech into structured form, e.g., into a database, for further downstream use. The most typical use cases are related to media monitoring. Research in IE is driven by the need to find accurate information about a particular topic in massive collections or streams of text. In addition to the traditional methods of evaluation in IE, we introduce a second measure of quality, which indicates the relevance, or usability, of the extracted facts for an end-user. An extracted fact may be correct, but irrelevant from the user's perspective. This dissertation presents work on two problems: 1. porting an IE system from one topic to another, and 2. assessing the user-oriented relevance of results produced by an IE system. All tasks are not equally responsive to IE, and performance on some tasks remains worse than on others, despite extensive customization. The first part of this study is motivated by the gap between performance obtained by IE systems for different topics. Our experience with customizing IE confirms the intuition that different domains exhibit different kinds of complexity, e.g., the business-related domain vs. the domain relating to natural events. The underlying reason is the variation in the language that is used to report the topics. The aim of this thesis is to improve IE results by determining which linguistic and structural features should be taken into consideration when customizing an IE system to a new topic. In the process of adapting the IE system to several domains and building their knowledge bases, we analysed the linguistic and structural characteristics of the domains, and the style of reporting. Information extraction is used as a methodological tool for linguistic observation, as it enables us to expose and explore how linguistic variation affects the IE results. The second part focuses on measuring relevance of the IE results, that is, how well the extracted information satisfies the user's interest. We identify which linguistic and structural features are useful for improving the performance on these scenarios. It has been observed elsewhere in NLP settings, that taking the features into account can produce better results. Thus, the findings presented in this work can be beneficial for a variety of approaches to IE, including those based on machine learning techniques.Tarkastelen työssäni eri uutisaiheiden kielellisiä ja rakenteellisia erityispiirteitä tiedoneristämisen näkökulmasta. Tiedoneristäminen (Information Extraction, IE) on ennalta määritellyn tiedon tunnistamista luonnollisen kielen tekstistä. Tiedoneristäminen eroaa tiedonhausta (Information Retrieval, IR) siten, että kun IR hakee haluttuun aiheeseen liittyviä dokumentteja, IE poimii useimmiten tiedonhaun kautta syötteenä saadusta valtavasta tekstimassasta tarkkaa asiatietoa, joka vastaa kysymyksiin kuka, mitä, missä ja milloin. Nämä vastaukset eli faktat tallennetaan tietokantaan, josta niitä voidaan jatkohyödyntää. Tyypillisesti tietoa on eristetty mm. uutisraporteista mediaseurantaa varten. Yleensä saman aihealueen dokumentit käsitellään samalla IE-sovelluksella, kuitenkin niin, että kyseisen sovelluksen tietämyskannat, kuten ontologiat ja syntaktis-semanttiset hahmot, räätälöidään aina erikseen uusiin aiheisiin. Tutkimukseni taustalla on havainto, että käyttämäni hahmopohjainen IE-sovellus ei ollut yhtä hyvin sovellettavissa tarkastelemiini uutisaiheisiin. Mittavasta räätälöinnistä huolimatta sovellus toimi heikommin uutisoinneissa, jotka raportoivat luonnonkatastrofeista ja tartuntataudeista kuin uutisoinneissa yritysten henkilövaihdoksista, investoinneista ja uusien tuotteiden lanseeraamisista. Uutisten aihe vaikuttaa uutisoinnin kielelliseen ilmaisuun ja tyyliin, ja sitä kautta IE-tuloksiin. Tutkimukseni keskittyy niiden kielellisten ja rakenteellisten piirteiden tunnistamiseen, jotka vaikeuttavat tai edesauttavat IE-sovelluksen räätälöimistä uusiin aihepiireihin, ja hyödyntämään tehtyjä havaintoja IE-tulosten parantamiseksi. Vaikka IE-sovelluksen tuottamat faktat voivat olla oikeita (correct), eivät kaikki faktat ole yhtä hyödyllisiä (relevant) tiedon tarvitsijalle. Työni pohjalta esittelen joukon yleisiä sekä aihekohtaisia piirteitä, joiden avulla IE-sovelluksen tunnistamia faktoja voidaan luokitella niiden hyödyllisyyden mukaan. Tässä työssä tehtyjen havaintojen hyödyntäminen voi auttaa parantamaan myös muiden IE-lähestymistapojen tuloksia

    Unsupervised Discovery of Scenario-Level Patterns for Information Extraction

    No full text
    Information Extraction (IE) systems are commonly based on pattern matching. Adapting an IE system to a new scenario entails the construction of a new pattern base -- a timeconsuming and expensive process. We have implemented a system for. finding patterns automatically from un-annotated text. Starting with a small initial set of seed patterns proposed by the user, the system applies an incremental discovery procedure to identify new patterns. We present experiments with evaluations which show that the resulting patterns exhibit high precision and recall

    2010. Assessment of utility in web mining for the domain of public health

    No full text
    Abstract This paper presents ongoing work on application of Information Extraction (IE) technology to domain of Public Health, in a real-world scenario. A central issue in IE is the quality of the results. We present two novel points. First, we distinguish the criteria for quality: the objective criteria that measure correctness in traditional terms (F-measure, recall and precision), and on the other hand, subjective criteria that measure the utility of the results to the end-user. Second, to obtain measures of utility, we build a environment that allows users to interact with the system by rating the analyzed content. We then build a classifier that learns from the user's responses, to predict the relevance scores for new events. We conduct experiments with learning to predict relevance, and discuss the results and their implications for text mining in the domain of Public Health
    corecore